arXiv: 1507.07106v4 [stat.ME] 18 Jan 2017 


Scalable Bayesian Variable Selection Using Nonlocal 
Prior Densities in Ultrahigh-dimensional Settings 

Minsuk Shin* Anirban BhattacharyaJ and Valen E. Johnson^ 
Department of Statistics, Texas A&M University, Texas, U.S.A 

Abstract 

Bayesian model selection procedures based on nonlocal alternative prior densities arc ex¬ 
tended to ultrahigh dimensional settings and compared to other variable selection procedures 
using precision-recall curves. Variable selection procedures included in these comparisons in¬ 
clude methods based on p-priors, reciprocal lasso, adaptive lasso, scad, and minimax concave 
penalty criteria. The use of precision-recall curves eliminates the sensitivity of our conclusions 
to the choice of tuning parameters. We find that Bayesian variable selection procedures based 
on nonlocal priors arc competitive to all other procedures in a range of simulation scenarios, 
and we subsequently explain this favorable performance through a theoretical examination of 
their consistency properties. When certain regularity conditions apply, we demonstrate that 
the nonlocal procedures arc consistent for linear - models even when the number of covariates 
p increases sub-exponentially with the sample size n. A model selection procedure based on 
Zellner’s //-prior is also found to be competitive with penalized likelihood methods in identi¬ 
fying the true model, but the posterior distribution on the model space induced by this method 
is much more dispersed than the posterior distribution induced on the model space by the 
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nonlocal prior methods. We investigate the asymptotic form of the marginal likelihood based 
on the nonlocal priors and show that it attains a unique term that cannot be derived from the 
other Bayesian model selection procedures. We also propose a scalable and efficient algorithm 
called Simplified Shotgun Stochastic Search with Screening (S5) to explore the enormous 
model space, and we show that S5 dramatically reduces the computing time without losing the 
capacity to search the interesting region in the model space, at least in the simulation settings 
considered. The S5 algorithm is available in an R package BayesS5 on CRAN. 

Key words: Bayesian variable selection; Nonlocal prior; Precision-recall cur\>e; Strong model 
consistency; Ultrahigh-dimensional data. 


1 Introduction 


In the context of hypothesis testing, Johnson and Rossell (2010) defined nonlocal (alternative) pri¬ 
ors as densities that are exactly zero whenever a model parameter equals its null value. Nonlocal 
priors were extended to model selection problems in Johnson and Rossell (2012), where product 
moment (pMoM) prior and product inverse moment (piMoM) prior densities were introduced as 
priors on a vector of regression coefficients. In p < n settings, model selection procedures based 
on these priors were demonstrated to have a strong model selection property: the posterior proba¬ 


bility of the true model converges to 1 as the sample size n increases. More recently, Rossell et al. 


( |2013 ) and Rossell and Telesca (2017) proposed product exponential moment (peMoM) prior den¬ 
sities that have similar behavior to piMoM densities near the origin. However, the behavior of 
nonlocal priors in p n settings remains understudied to date (particularly in comparison to other 
commonly used variables selection procedures), which serves as the motivation for this article. 

We undertook a detailed simulation study to compare the performance of nonlocal priors in p 
n settings under sparsity with a host of penalization methods including the least absolute shrinkage 


and selection operator (lasso; Tibshirani (1996)), smoothly clipped absolute deviation (scad; Fan 


and G|(|2001[)), adaptive lasso (]Zou][2006j), minimum convex penalty (mcp; Zhang (|2010|)), and the 
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reciprocal lasso (rlasso), recently been proposed by Song and Liang ( 2015 ). The penalty function 
of the rlasso is equivalent to the negative log-kernel of nonlocal prior densities; further connections 
are described in Section[5] As a natural Bayesian competitor, we also considered the widely used g- 
prior (Zellner[|1986[|Liang et al. 2008|), which is a local prior in the sense of Johnson and Rossell 


(2010). We used precision-recall curves (Davis and Goadrich 2006) as a basis for comparison 
between methods. These curves eliminate the effect of the choice of tuning parameters for each 
method so that the comparison across different methods can be transparent. It has been argued 


(Davis and Goadrich [2006 ) that in cases where only a tiny proportion of variables are significant, 
precision-recall curves are more appropriate tools for comparison than are the more widely used 
receiver operating characteristic curves. While the ROC curves present a trade-off between the 
type I error and the power of a decision procedure, precision-recall curves examine the trade-off 
between the power and the false discovery rate. 

Our studies indicate that Bayesian procedures based on nonlocal priors and the ry-prior perform 
better than penalized likelihood approaches in a sense that they achieve a lower false discovery 
rate, while maintaining the same power of the decision procedure. Posterior distributions on the 
model space based on nonlocal priors were found to be more tightly concentrated around the 
maximum a posteriori model than the posterior based on ry-priors, implying that they had a faster 
rate of posterior concentration. We also identified the oracle hyperparameter that maximizes the 
posterior probability of the true model for the Bayesian procedures. The growth-rate of these oracle 
hyperparameters with p also offers an interesting contrast between nonlocal and local priors. In 
the case of <y-priors, the oracle value of g varied between 7.83 x 10 8 and 4.29 x 10 13 as p ranged 
between 1000 and 20000. For the same range of p, the oracle value of r varied between 1.97 


and 3.60, where r is the tuning parameter for nonlocal priors described in Section 2. George and 


Foster (2000) argued from a minimax perspective that the g parameter should satisfy g 


P , 


which explains the large values of the optimal g. However, using asymptotic arguments to obtain 
default hyperparameters is difficult because the constant of proportionality is typically unknown. 
Moreover, when g is very large, the ry-prior assigns negligible prior mass at the origin, essentially 


3 
























resulting in a nonlocal like prior. A similar point can be made about the recently proposed Bayesian 


shrinking and diffusing (BASAD) priors (Narisetty and He 2014). On the other hand, the optimal 
hyperparameter value for the nonlocal priors is stable with increasing p, growing at a very slow 
rate. 

Motivated by this empirical finding, we studied properties of two classes of nonlocal priors 
allowing the hyperparameter r to scale with p. Using a fixed value of r, it seems that strong model 


selection consistency is possible only when p < n (Johnson and Rossell, 2012). In this article, we 
establish that nonlocal priors can achieve strong model selection consistency even when the number 
of variables p increases sub-exponentially in the sample size n, provided that the hyperparameter r 
is asymptotically larger than logp. This theoretical result is consistent with our empirical finding. 


2 Nonlocal prior densities for regression coefficients 

We consider the standard setup of a Gaussian linear regression model with a univariate response 
and p candidate predictors. Let y = (pi,..., y„) T denote a vector of responses for n individuals 
and X an n x p matrix of covariates. We denote a model by k = {k\, ..., At| k |}» with 1 < k\ < 

... < fcik| < P- Given a model k, let X k denote the design matrix formed from the columns of 
X n corresponding to model k and u k = (/3 k} i,..., /3 k , |k|) T the regression coefficient for model k. 
Under each model k, the linear regression model for the data is 


y = X k /3 k + e, 


( 1 ) 


where e ~ iV n (0, cr 2 I n ). Let t denote the true, or data-generating model and let /3° be the true 
regression coefficient under model t. We assume that the true model is fixed but unknown. 


Given a model k, the product exponential moment (peMoM) prior density (Rossell et al. 2013 
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Figure 1: Nonlocal prior density functions for a single regression coefficient with r = 5; for the 
piMoM prior, r — 1. 


Rossell and Telesca, 2017) for the vector of regression coefficients /3 k is defined as 


71 Wk I c 2 ,t, k) = C |k| JJexp{-^ j /(2(T 2 r) - r/^y}- 

3 =1 


( 2 ) 


The normalizing constant C can be explicitly calculated as 


/ OO 

exp{—f 2 /(2cr 2 r) - r/f 2 }dt = (2vra 2 r) 1/2 exp{-(2/a 2 ) 1/2 }, 

-OO 


(3) 


since / exp{—/i/f 2 — (t 2 }dt = (n/C) 1 ^ 2 exp{—2( J u() 1 / 2 }. 

Second, for a fixed positive integer r, the product inverse-moment (piMoM) prior density 
(Johnson and Rossellj 2012) for /3 k is given by 


vr(/3 k I er 2 , r, k) = C* |k| 


[(/3 kJ ) r exp{—r//3 kj }], 


(4) 


j=i 


where C* = r r+ 1 / 2 r(r — 1/2) for r > 1/2, where T(-) is the gamma function. 

The piMoM and peMoM prior densities are nonlocal in the sense that the density value at the 
origin is exactly zero. This feature of the densities for a single regression coefficient is illustrated 
in Figure [I] Since the piMoM prior densities and the peMoM prior densities have the same term 
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exp{—r/ 3 2 } that controls the behavior of the density function around the origin, they attain almost 
the same shape of the density function at the origin, which yields the same theoretical results in an 
asymptotic sense. Further details regarding this point are discussed in Section [4] 

We focus on these two classes of nonlocal priors in the sequel. Note that in both ([2]) and Q, 
7r(/3k) = 0 when /3 k = 0; a defining feature of nonlocal priors. The distinction between the peMoM 
and the piMoM priors mainly involves their tail behavior. Whereas peMoM priors possess Gaus¬ 
sian tails, the piMoM prior densities have inverse polynomial tails. For example, piMoM densities 
with r = 1 have Cauchy-like tails, which has implications for their finite sample consistency and 
asymptotic bias in posterior mean estimates of regression coefficients. Since similar conditions are 
later imposed on the hyperparameter r appearing in ([2]) and ([4]), at the risk of some ambiguity we 
use the same notation for the two hyperparameters in these equations. 

In addition to imposing priors on the regression parameters given a model, we need to place a 
prior on the space of models to complete the prior specification. We consider a uniform prior on 
the model space restricted to models having size less than or equal to q n , with q n < n, i.e., 


?r(k) oc /(|k| < q n ), 


(5) 


where /(■) denotes the indicator function and with a slight abuse of notation, we denote the prior 
on the space of models by n as well. Similar priors have been considered in the literature by 


Jiang ( j2007j ) and Liang et al.|( ]2013 ). Since the peMoM and piMoM priors already induce a strong 


penalty on the size of the model space (see Section [4]), we do not need to additionally penalize 


larger models using, for example, model space priors of the type discussed in Scott and Berger 


( |20T0l ). 

Under a peMoM prior ([2j) on the regression coefficients, the marginal likelihood m k (y) under 
model k given a 2 can be obtained by integrating out /3 k , resulting in 

m k (y) = (27TCT 2 )"f C'“ |k| Q k exp{-.R k /(2cx 2 )}, 
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where 


Rk = y T (ln - Pk)j/, P k = X k (A^X k + l/rI k )- 1 ^, 

r ~ _ |k| 

Qk = / exp{-(/3 k -/3 k ) T S k 1 (/3 k -/3 k )/(2a 2 ) - ^r//3 kj }ci/3 k , (6) 

J 3 =i 

5k = (X"X k + l/rlk)- 1 ^, S k = (X^A k + l/rlk)- 1 . 


Similarly, the marginal likelihood using the piMoM prior densities Q can be expressed as 

m k (y) = (27ra 2 ) _ t C*~ l k l Q k exp{— R^/(2a 2 )}, where 


K 

Qt 

Pk 


y T (l n -P k )y , P k = X k (A" k A" k ) _1 X k , 


|k| 


- A) T Sr‘(/ik - &)/(2 <t 2 ) - 

^ i=i j = l 

(A-Xk)- 1 ^^, S* = (A-Xk)- 1 . 


(7) 


The integrals for Q k and Q k cannot be obtained in closed forms, so for computational purposes 
we make Laplace approximations to m k (y). The expressions for the marginal likelihood derived 
here is nevertheless important for our theoretical study in Section |4} 


3 Numerical results 


3.1 Simulation studies using precision-recall curves 


To illustrate the performance of nonlocal priors in ultrahigh-dimensional settings and to compare 


their performance with other methods, we calculated precision-recall curves (Davis and Goadrich 


2006) for all selection procedures. A precision-recall curve plots the precision = TP/(TP + FP), 


versus recall (or sensitivity) = TP/(TP + FN), where TP, FP and FN respectively denote the number 
of true positives, false positives, and false negatives, as the tuning parameter is varied. The efficacy 
of a procedure can be measured by the area under the precision-recall curve; the greater the area, 
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the more accurate the method. Since both precision and recall take values in [0,1], the area under 
the curve for an ideal precision-recall curve is 1. We used two (n, p) combinations, namely (n, p) = 
(400,10000) and (n,p) = (400,20000), and plotted the average of the precision-recall curves 
obtained from 100 independent replicates of each procedure. 

We compared the performance of peMoM and piMoM priors to a number of frequentist penal¬ 
ized likelihood methods: lasso ([Tibshirani, 1996), adaptive lasso ( |Zou[ |2006| ), scad (Fa n and Li] 
20011, and minimax concave penalty (Zhang} 2010). We used the R package ncvreg to fit these pe¬ 


nalized likelihood methods. We also included reciprocal lasso in our simulation studies. However, 
due to computational constraints involved in implementing the full rlasso procedure, we followed 
the recommendation in Song and Liang (2015) and instead implemented the reduced rlasso. The 
reduced rlasso procedure is a simplified version of rlasso that uses the least square estimators of (3 
when minimizing the rlasso objective function. 

We considered Zellner’s g-prior (Zellner, |1986[ Liang et al. 2008) as a competing Bayesian 
method, with /3 k | k, a 2 ~ iV(0, (ja 2 (A^AV) ~~ 1 ) and g is the tuning parameter. With the prior 
7r(cr 2 ) oc 1/cr 2 , the marginal likelihood m k (g) oc (1 + ^r) — I k l/ 2 {1 + g(l — _D£)} - ( n- i)/ 2 can be 
obtained in a closed form; see for example, Liang et al. (2008, pp 412), where D\ is the ordinary 
coefficient of determination for the model k. 

A uniform model prior ([5]) was considered for all Bayesian procedures. This prior was chosen 
for several reasons. First, construction of the PR curves requires maximization over model hyper¬ 
parameters, which is most easily achieved if there is only one unknown hyperparameter. We also 
wished to avoid providing an advantage to the Bayesian methods by introducing additional tuning 
parameters into these methods that were not present in the penalized likelihood methods. Further¬ 
more, the use of non-uniform priors on the model space introduces (at least) one more degree of 
freedom into the comparisons between methods, and our intent was to compare the effects of the 
penalties imposed on regression coefficients by both penalized likelihood and Bayesian methods. 
At first blush, this might appear to put Bayesian methods like those based on the g-prior at a dis¬ 
advantage, since such methods do not yield consistent variable selection even in p < n settings 
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without prior sparsity penalties on the model space (when g is held fixed as n increases). However, 
in the construction of our PR curves, we allowed prior hyperparameters to increase with n, which 
effectively allowed the Bayesian methods to impose additional sparseness penalties through the 
introduction of large hyperparameter values. 

We arbitrarily fixed r = 1 for the piMoM prior (|4]) and used an inverse-gamma prior on a 2 
with parameters (0.1, 0.1) for the peMoM, piMoM priors, and g-priors. Posterior computations 
for the peMoM, piMoM and g-priors were implemented using the Simplified Shotgun Stochastic 
Search with Screening (S5) algorithm described in Section [7] The maximum a posteriori model 
was used in each case to summarize the model selection performance. The precision-recall curves 
are drawn by varying the hyperparameters (r for the nonlocal priors and g for the g-priors), so the 
comparison between the model selection based on the nonlocal priors and the g-prior is free of 
the choice of hyperparameters. Because of their high computational burden, we could not include 
BASAD (Narisetty and He] 2014) in the comparisons. 

For each simulation setting, we simulated data according to a Gaussian linear model as in (jTj) 
with the fixed true model t = (1, 2, 3,4,5} with the true regression coefficient /3° = (0.50, 0.75,1.00,1.25,1.50) 1 
and a = 1.5. Also, the signs of the true regression coefficients were randomly determined with 
probability one-half. Each row of X was independently generated from a iV(0, £) distribution with 
one of the following covariance structures: 


Case (1): compound symmetry design; E^y = 0.5, if j ^ j' and Ey = 1, 1 < j,j' < p. 

Case (2): autoregressive correlated design; Ey/ = 1 < j,j' < p. 

Case (3): isotropic design; E = l p . 

Figure |2]plots the precision-recall curves averaged over 100 simulation replicates for the differ¬ 
ent methods across the two (n,p) pairs and the three covariate designs. From Figure [2] it is evident 
that the precision-recall curves for the peMoM and piMoM priors have an overall better perfor¬ 
mance than the penalized likelihood methods lasso, adaptive lasso, scad, and mcp. For decision 
procedures having the same power, this implies that the nonlocal priors achieve lower false dis¬ 
covery rates. As discussed in Section [5} since the reduced rlasso shares the same nonlocal kernel 
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as the nonlocal priors, it has a similar selection performance. The figure also shows that Zellner’s 
( 7 -prior attains comparable performance with the nonlocal priors in terms of the precision-recall 
curves. 


3.2 Further comparison with Zellner’s g-prior 


The similarity of the performances of the ( 7 -prior and the nonlocal priors in terms of precision- 
recall curves begs for closer comparisons of these procedures. For this reason, we also investigated 
the concentration of the posterior densities around their maximum models. To this end, we fixed 
p = 20, 000 and varied n from 150 to 400; the data generating mechanism was exactly the same 
as in Section |3Tj The left column of Figure [3] displays the posterior probability of the true model 
under the peMoM, piMoM and ( 7 -prior models versus n for the three covariate designs in Section 


3.1 The plot shows that the posterior probability of the true model increases with n for all three 
methods, with the peMoM and piMoM priors almost uniformly dominating the ( 7 -prior, implying 
a higher concentration of the posterior around the true model for the nonlocal priors. 

This tendency is confirmed in the right panel of Figure [3j where we plot the number of models 
k which achieve a posterior odds ratio 7r(k | y)/ 7r(k j y ) > 0.001, where k is the maximum 
a posteriori model. This plot clearly shows that the posterior distribution on the model space 
from the ( 7 -priors is more diffuse than those obtained using the nonlocal prior methods. These 
comparisons were based on fitting the hyperparameters g and r at their oracle value, i.e., the value 
which maximized the posterior probability of the true model for a given value of n. 

The magnitudes of the oracle hyperparameters under each model also present an interesting 
contrast between the local and nonlocal priors. We observed that the oracle value of g increased 
rapidly with p, whereas the oracle value of r was much more stable. This phenomenon is illustrated 
in Table [T| that shows the oracle hyperparameter value averaged over 100 replicates for the three 
different covariate designs in Section |3.1| For this comparison, we fixed n = 400 and varied p 
between 1000 and 20, 000; five representative values are displayed. The oracle values for g are 
on a completely different scale from the oracle values r, and they vary more with p. This table 
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Figure 2: Plot of the mean precision-precision curves over 100 datasets with (n,p) = 

(400,10000)(first column) and (n,p) = (400, 20000)(second column). Top: case (1); middle: 
case (2); bottom: case (3). 
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Figure 3: Averaged posterior true model probability and the number of models which attain the 
posterior odds ratio, with respect to the maximum a posteriori model, larger than 0.001 with the 
fixed p = 20000 and varying n. Top: case (1); middle: case (2); bottom: case (3). 
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Table 1: Optimal hyperparameters for Bayesian model selection methods 


The number of predictors 




p = 1000 

p = 2000 

p = 5000 

p = 10000 

p = 20000 

Case (1) 

peMoM 

piMoM 

g-prior 

2.24 

2.16 

7.83 x 10 8 

2.72 

2.59 

2.87 x 10 9 

2.88 

2.70 

3.05 x 10 9 

3.32 

3.04 

9.66 x 10 9 

3.60 

3.26 

1.70 x 10 10 

Case (2) 

peMoM 

piMoM 

g-prior 

1.97 

1.97 

8.56 x 10 9 

2.29 

2.20 

2.55 x 10 10 

2.34 

2.32 

2.62 x 10 10 

2.75 

2.66 

6.58 x 10 10 

3.00 

2.86 

1.25 x 10 11 

Case (3) 

peMoM 

piMoM 

g-prior 

2.66 

2.61 

1.26 x 10 12 

3.00 

2.94 

8.84 x 10 12 

3.00 

2.94 

9.67 x 10 12 

3.10 

2.94 

6.81 x 10 12 

3.60 

3.46 

4.29 x 10 13 


confirms the recommendations in George and Foster (2000) for setting g = p 2 based on minimax 
arguments. However, the finite sample behavior of the optimal choice of g is unclear, which means 
that the large variance of the optimal hyperparameter value is likely to hinder the selection of g in 
real applications. Finally, we note that such large values of g effectively convert the local g-priors 
into nonlocal priors by effectively collapsing the g-prior density to 0 at the origin. 


4 Model selection consistency 


The empirical performance of the peMoM and piMoM priors suggests that the hyperparameter r 


should be increased slowly with p. While Johnson and Rossell (2012) were able to show strong 
selection consistency with a fixed value of r, it is not clear whether their proof can be extended to 
p>n cases. Motivated by the empirical findings of the last section, we next investigated the strong 
consistency properties of peMoM and piMoM priors when r was allowed to grow at a logarithmic 
rate in p. We found that in such cases, both peMoM and piMoM priors achieve strong model 
selection consistency under standard regularity assumptions when p increases sub-exponentially 
with n, i.e., logp = 0(n a ) for a e (0,1). 

Henceforth, we use T n/p instead of r to denote the hyperparameter in the peMoM and piMoM 
priors in Q and (|4]) respectively. The normalizing constants for these priors is now denoted by C UiP 
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and C* respectively. Before providing our theoretical results, we first state a number of regularity 
conditions. Let Vj(A) denote the j-th largest nonzero eigenvalue of an arbitrary matrix A, and let 


?y k* = _ .min Vj(X£X k /n), ?4 = max Uj(X^X k /n). 


( 8 ) 


For sequences a n and b n , a n h b n indicates b n = 0(a n ), and a n >- b n indicates b n = o(a n ). 
With this notation, we assume that the following regularity conditions apply. 

Assumption 1. There exists a G (0,1) such that logp = 0(n a ). 

Assumption 2. log/; -< T n . p -< n. 

Assumption 3. |k| < q n , where q n -< 


Assumption 4. min >- ^ rh£ . 

k:|k|<g n 


Assumption 5. C\ < u u < /g* < ^ f° r some positive constants C\ and C 2 


Several comments regarding these conditions are worth making. Assumption 1 allows p to 
grow sub-exponentially with n. Our theoretical results continue to hold when p grows polynomi- 
ally in n, i.e., at the rate 0(n 7 ) for some 7 > 1. Assumption 2 reflects our empirical findings about 


the oracle r = T n/p in Section 3.1 which was observed to grow slowly with p. We need the bound 
on q n in Assumption 3 to ensure that the least square estimator of a model is consistent when a 
model contains the true model. In the p < n setting, Johnson and Rossell ( 2012| assumed that 
all eigenvalues of the Gram matrix )/n are bounded above and below by global constants 

for all k. However, this assumption is no longer viable when p»n and we replace that by As¬ 
sumption 4, where the minimum of the minimum eigenvalue of {X^X^/n over all submodels k 
with | k | < q n is allowed to decrease with increasing n and p. Assumption 4 is called the sparse 


Riesz condition and is also used in Chen and Chen (2008) and Kim et al. (2012). Narisetty and 


He (2014) showed that Assumption 4 holds with overwhelmingly large probability when the rows 
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of the design matrix are independent with an isotropic sub-Gaussian distribution. Even though the 


assumption of sub-Gaussian tails on the covariates is difficult to verify, the results in Narisetty and 


He (2014) show that Assumption 4 can be satisfied for some sequence of design matrices. 

We now state a Theorem that demonstrates that model selection procedures based on the pe- 
MoM and piMoM nonlocal prior densities achieve strong consistency under the proposed regular¬ 
ity conditions. A proof of the Theorem is provided in the Supplemental Materials. 


Theorem 1. Suppose a 2 is known and that Assumptions 1-5 hold. Let 7r(t | y) denote the 
posterior probability of the true model obtained under a peMoM prior ([2]). Also, assume a uniform 
prior on all models of size less than or equal to q n , i.e., 7r(k) oc /(|k| < q n ). Then, 7r(t | y) 
converges to one in probability as n goes to oo. 


Corollary 2. Assume the conditions of the preceding Theorem apply. Let 7r(t | y) denote the 
posterior probability of the true model obtained under a piMoM prior density ([4]). Then, 7r(t | y) 
converges to one in probability as n goes to oo. 


We note that these results apply also if a beta-Bemoulli prior is imposed on the model space 
as in 


Scott and Berger (2010), because the effect of that prior is asymptotically negligible when 


|k| <q n ~<n. 


In most applications, cr 2 is unknown, and it is thus necessary to specify a prior density on it. By 
imposing a proper inverse gamma prior density on cr 2 , we can obtain the strong model consistency 
result stated in the Theorem below. The proof is again deferred to the Supplemental Materials. 


Theorem 3. Suppose a 2 is unknown and a proper inverse gamma density with parameters (do, hf) 
is assumed for a 2 . Also, let 7r(t | y) denote the posterior probability of the true model evalu¬ 
ated using peMoM priors. Then if Assumptions 1-5 are satisfied, 7r(t | y) converges to one in 
probability as n goes to oo. 


Corollary 4. Suppose the conditions of the preceding Theorem apply, but that it (t | y) now denotes 
the posterior probability of the true model obtained under a piMoM prior density. Then 7r(t | y) 
converges to one in probability as n goes to oo. 
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5 Connections between nonlocal priors and reciprocal lasso 


In this section, we highlight the connection between the rlasso of [Song and Liang ( [2015] ) and 
Bayesian variable selection procedures based on our nonlocal priors. We begin by noting that the 
objective function g(/3 k; k) of rlasso on a model k can be expressed as follows: 


|k| 

g (Ad k) = II y - X k f3 k \\l + Y T n,p/\pKj 

3 = 1 


(9) 


The optimal model is selected by minimizing this objective function with respect to (3 k and k. 
It is clear that the penalty function r n, P /\Pk,j\ in ([9]) is similar to the negative log-density 


of piMoM nonlocal priors as proposed in Johnson and Rossell (j2012 pp 659) and Johnson and 


Rossell (20101 pp 149). The main difference between the nonlocal prior version of rlasso and the 


piMoM-type prior densities proposed in the previous section is the power of 13 in the exponential 
kernels. For the rlasso penalty this power is 1, while for piMoM-type prior densities it is 2. The 
implications of this difference are apparent from the following proposition. 


Proposition 5. For a given model k, suppose that /?£ is the minimizer of the objective function ([9]), 
and again let (3^ denote the least square estimator of 3 under model k. Assume that T n/p -< n, and 
there exist strictly positive contants Cl and Cjj such that Cl < ^k* < if < Cu- Then, for any 

< >- On,p/n) 1/3 , 


p 


PZ i R (pc <) 


0, 


where R{u ; e) = {x e : |Xj — Uj\ < e,j = 1,..., |k| }. 


The proposition shows that under standard conditions on the eigenvalues of the Gram matrix 
XfX k /n, the estimator derived from ([9]) is asymptotically within (r rt , p /n) '/ 3 distance of the least 
squares estimator /3^. On the other hand, results cited in the previous section show that maximum 
a posteriori estimators obtained from the piMoM-type prior densities reside at an asymptotic dis¬ 
tance of (rn^/n) 1 / 4 from the least squares estimator. Variable selection procedures based on both 
forms of piMoM priors thus achieve adaptive penalties on the regression coefficients in the sense 
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described in Song and Liang ( [2015 ). 

Although rlasso is proposed as a penalized likelihood approach, the computational procedure 
to optimize its objective function is quite different from the other penalized likelihood methods. 
The resulting computational complexity of this optimization procedure, which contains a discon¬ 
tinuous penalty function, is NP-hard. This suggests that the formulation of this nonlocal penalty 
in a penalized likelihood framework is unlikely to provide significant computational advantages 
over related Bayesian model selection procedures, even though the inferential advantages of the 
Bayesian framework are lost. 


6 Asymptotic behavior of marginal likelihoods based on non¬ 
local priors 

From Lemma [T] in the Supplemental Materials, it follows that the asymptotic log-marginal likeli¬ 
hood of a model k based on a peMoM or piMoM prior density can be expressed as 


log7r(k | y) = /(3k) + logQk - |k| logC'n^ 

M 

3 = 1 

for some constant C, /3k is the maximum likelihood estimator under the model k, i.e. /3k = 
(XlXk)- l Xly, and 


PT„,p{/3k,j ) ~ 


{ (nr n , p Uk) 1/2 , 


if Ail <c(^r 1,i 

if|AJ>e(g;)- I/4 , 


( 10 ) 


for some constant c and some arbitrary sequence Uk with < « k < u£. We note that the 
strength of the correlation between the variables in the model k affects the behavior of Uk, and 
(rmk/T^p) -1 / 4 converges to zero as n tends to infinity due to Assumption 4 described in Section 
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On the other hand, the penalty term in the other Bayesian model selection approaches is quite 
different from that of the nonlocal priors as in ( fTO] ). The marginal likelihood based on the yy-prior 
when a 2 is known can be expressed as 

10 k) - |k| log(l + <7)/2. 

Narisetty and He(2014) demonstrated that BAS AD achieves the strong model selection con¬ 
sistency. This consistency follows from that the fact that the BASAD “penalty” is asymptotically 
equivalent to 


I0 k ) - c|k| log(p), 


(U) 


where c is some constant. Yang et al. (2016) and Castillo et al. (2012) also considered a similar 
penalty term on the model space, which implies that the posterior probability for their procedures 
can be expressed in the same form as ( [IT] ). When g = p 2c , the marginal likelihood based on a 
g -prior is asymptotically equivalent to ( ]TT[ ). 

The asymptotic term of the marginal likelihoods is quite different from that of the nonlo¬ 
cal priors, since the penalty terms in the other Bayesian approaches only focus on the model 
size without considering the different weights on variables in the model. The marginal likeli¬ 
hoods based on nonlocal priors, however, impose different penalties on each predictor in the given 
model. When the MLE of the regression coefficient in the model is asymptotically close to zero 
(lAcjl < c(nw k /r n)P ) -1 / 4 ), the model that contains the corresponding variable would be strongly 
penalized by (nr„j,// k )'' /2 . In contrast, when the MLE is asymptotically significant (|/3 k j > 
c(nM k /r ri]P )" 1 / 4 ), the penalty attains a different weight based on the MLE (p Tn , p (Ac,j) ~ T n )P /Pij)- 

This analysis highlights the fact that the nonlocal priors are able to adapt their penalty for the 
inclusion of covariates based on the observed data, whereas the local priors must instead rely on a 
prior penalty on non-sparse models. 
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7 Computational strategy 


In p n settings, full posterior sampling using existing Markov chain Monte Carlo (MCMC) al¬ 
gorithms is highly inefficient and often not feasible from a practical perspective. To overcome this 
problem, we propose a scalable stochastic search algorithm aimed at rapidly identifying regions of 
high posterior probability and finding the maximum a posteriori (MAP) model. Our main innova¬ 


tion is to develop a stochastic search algorithm combining isis-like screening techniques (Fan and 


Lv 2008) and temperature control that is commonly used in global optimization algorithms such 


as simulated annealing (Kirkpatrick and Vecchi[ |1983| ). 

To describe our proposed algorithm, consider the MAP model k that can be expressed as 


k = argmax{7r(k | y)}, 
ker* 


( 12 ) 


where T* is the set of all models assigned non-zero prior probability. 


7.1 Shotgun stochastic search algorithm 


Hans et al. (2007) proposed the shotgun stochastic search (SSS) algorithm in 


ciently navigate through very large model spaces and identify global maxima. 
{T+, T- T 0 }, where T+ = {kU {j} : j G k c }, T~ = {k\{j} : j G k}, and T° 
l G k c , j G k}, the SSS procedure is described in Algorithm 1. 


an attempt to effi- 
Letting nbd(k) = 


Algorithm 1 Shotgun Stochastic Search (SSS) 


Choose an initial model k (1) 

For i — 1 to i — N — 1 

Compute 7r(k | y) for all k G nbd(k ( ^) 

Sample k + , k , and k°, from T + , T - , and F°, with probabilities proportional to 7r(k | y) 
Sample k^ +1 ^ from {k + , k . k 0 }, with probability proportional to 

{vr(k+ | y), 7r(k" | y), 7r(k° | y)} 


The MAP model can be identified by the model that achieves the largest (unnormalized) pos- 
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terior probability among those models searched by SSS. 


7.2 Simplified shotgun stochastic search algorithm with screening (S5) 


SSS is effective in exploring regions of high posterior model probability, but its computational cost 
is still expensive because it requires the evaluation of marginal probabilities for models in T h , 
and r° at each iteration. The largest computational burden occurs for the evaluation of marginal 
likelihood for models in F°, since |r°| = k |(p — |k|). To improve the computational efficiency 
of SSS, we propose a modified version which only examines models in T + and Y \ which have 
cardinality p — |k| and |k|, respectively. However, by ignoring T 0 in the sampling updates we 
make the algorithm less likely to explore “interesting” regions of high posterior model probability, 
and therefore more likely to get stuck in local maxima. To counter this problem, we introduce a 
“temperature parameter” analogous to simulated annealing which allows our algorithm to explore 
a broader spectrum of models. 

Even though ignoring models in T° reduces the computational burden of the SSS algorithm, the 
calculation of p posterior model probabilities in every iteration is still computationally prohibitive 
when p is very large. To further reduce the computational burden, we borrow ideas from the 


Iterative Sure Independence Screening (isis; Fan and Lv (2008)) and consider only those variables 
which have a large correlation with the residuals of the current model. More precisely, we examine 
the products \r^Xj\, where r k is the residual of the model k, for j = 1.... . p, after every iteration 
of the modified shotgun stochastic search algorithm, and then restrict attention to variables for 
which {| r^Xj\ : j — 1,..., p} is large (we assume that the columns of X have been standardized). 
This yields a scalable algorithm even when the number of variables p is large. 

With these ingredients, we propose a new stochastic model search algorithm called Simplified 
Shotgun Stochastic Search with Screening (S5), which is described in Algorithm 2. 

In S5, S k is the union of variables in k and the top M n variables, obtained by screening using 
the residuals from the model k. The screened neighborhood of the model k can be defined as 
nbd scr (k) = {r+ r ,r - }, where T+ r = {kU {j} :je k c D S k }. 
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Algorithm 2 Simplified Shotgun Stochastic Search with Screening (S5) 


Set a temperature schedule ti > t 2 > ■ ■ ■ > tL > 0 

Choose an initial model k (11) and a set of variables after screening S k (i,i) based on k^ 1,1 ) 

For l = 1 in l — L 

For i in 1,..., J — 1 

Compute all 7r(k | y) for all k e nbd scr (k^’^) 

Sample k + and k , from F+ r and T~, with probabilities proportional to 7r(k | y) 1 ^' 

Sample k^ +1 ’^ from {k + , k~}, with probability proportional to {7r(k + | y) 1 ^ 1 , 7r(k~ | y) 1 ^ 1 } 
Update the set of considered variables S k (*+i,o to be the union of variables in k^ +1,/) and 
the top M n variables according to {[r^ (i+l l) Xj\ : j = 1,... ,p} 


Even though this algorithm is designed to identify the MAP model, it also provides an ap¬ 
proximation to the posterior model probability of each model. The uncertainty of the model space 
can be measured by approximating the normalizing constant from the (unnormalized) posterior 
probabilities of the models explored by the algorithm. 

Denoting the computational complexity of the evaluation of the unnormalized posterior model 
probability of the largest model among searched models by E n , the computational complexity 
of the SSS algorithm can be expressed as the product of the number of explored models by the 
algorithm and E n , which is [0{Np}+0{Nq n }+0{N(p—q n )q n }] x E n , where q n is the maximum 
size of model among searched models and q n < n <C p. 

On the other hand, the S5 only considers M n variables after the screening step in each iteration, 
which dramatically reduces the number of models to be considered in constructing the neighbor¬ 
hood, 0{JL(M n — q„)} + 0(JLM n ). Therefore, the resulting computational complexity is 

[0{JL{M n - q n )} + 0{JLM n )] x E n + O(JLnp), 

where q n < M n . When the computational complexity for screening steps, O(JLnp), is dominated 
by the other terms, the computational complexity is almost independent of p. As a result, the 
proposed algorithm is scalable in the sense that the resulting computational complexity is typically 
robust to the size of p. 


21 



7.3 Performance comparisons between S5 and SSS 

We examined the computational performance of S5 to SSS in identifying the MAP model under 
a piMOM prior with r n;P = log n log p and r — 1. We generated data according to Case (1) in 
Section [3] with a fixed sample size (n = 200), and a varying number of covariates p. We set 
M n = 20, L = 20, and J = 20 for S5. To match the total number of iterations between S5 and 
SSS, we set N = 400 for SSS. All computations were implemented in R. 

Figure [4] shows the average computation time and the number of models searched before hitting 
the MAP model for the first time for the S5 and SSS algorithms. All averages were based on 100 
simulated datasets, and both algorithms obtained the same MAP model for all data sets. Panel (a) 
shows that the computation time of SSS increases roughly at a p 2 rate, but that the computation time 
for S5 was nearly independent of the number of covariates p (about 4 seconds). For example when 
p = 2, 000, SSS first found the MAP model in an average of 1,360 seconds (about 23 minutes), 
whereas S5 hit the MAP model after about only 4 seconds. Interestingly, panel (b) of Figure [4] 
also illustrates that the S5 algorithms explored only 181 models on average to hit the MAP model, 
whereas SSS typically visited slightly more than 38,000 models. Thus, not only is S5 much faster 
than SSS in identifying the MAP model, but it also visited far fewer models before visiting the 
MAP model. 


8 Real data analysis 


8.1 Analysis of polymerase chain reaction (PCR) data 


Lan et al. ([2006) studied coordinated regulation of gene expression levels on 31 female and 29 
male mice (n = 60). A number of psychological phenotypes, including numbers of stearoyl-CoA 
desaturase 1 (SCD1), glycerol-3-phosphate acyltransferase (GPAT) and phos- phoenopyruvate car- 
boxykinase (PEPCK), were measured by quantitative real-time RT-PCR, along with 22,575 gene 
expression values. The resulting data set is publicly available at http : / /www. ncbi . nlm. 
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Figure 4: (a) the average computation time to first hit the MAP model; (b) the average number of 
models searched before hitting the MAP model. The left y -axis is in a logarithmic scale and the 
right y-axis is in the raw scale. 


nih . gov/geo (accession number GSE3330). 


Zhang et al. (2009) used penalized orthogonal components regression to predict the three phe¬ 


notypes mentioned above based on the high-dimensional gene expression data. Bondell and Reich 


(2012) also used the same data set to examine their model selection procedure based on penaliz¬ 
ing regression coefficients within a (marginal or joint) credible interval obtained from a ridge-type 
prior. For brevity, we restrict attention here to SCD1 as the response variable. 

Since the ground truth regarding the true significant variables is not known for this data, we 
compared our approach with a host of competitors on predictive accuracy and parsimony of the 
selected model. 

Prior to analyses, we standardized the covariates and randomly split the data set into 5 test sam¬ 
ples and 55 training samples to evaluate the out-of-sample mean square prediction error (MSPE) 


MSPE = ( yi -X^l) 2 /\T test \, 

where T test is the index set of the test samples and /3~ r is the least square estimator under the 
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estimated model k based on the training samples. To avoid sensitivity to a particular split, we 
considered 100 replications of the training and test sample generation. To measure the stability of 
model selection, we considered the number of variables that were (i) selected at least 95 times, and 
(ii) at least once, out of the 100 replicates. 


Due to the high-computational burden of the penalized credible interval (Bondell and Reich 


2012) approach, we followed the pre-processing step suggested in their article to marginally screen 


variables to reduce to 2000 variables (1999 genes and gender). For all the other approaches, all 
22,575 genes were used. For the nonlocal priors, we considered both the MAP estimator and the 
least squares (LS) estimator from the MAP model. For the g- prior, we set g = jr as recommended 


in George and Foster (2000). For the penalized likelihood procedures, we used ten-fold cross 
validation to choose the tuning parameter. 

To choose the hyperparameter t u/[) for the nonlocal priors, we used a procedure proposed by 
Nikooienejad et al.]( 2016). That procedure sets the hyperparameter so that the L\ distance between 
the posterior distribution on the regression parameters under the null distribution (i.e., [5 = 0) 
and the nonlocal prior distributions on these parameters is constrained to be less than a specified 
value (e.g., p _1//2 ). The average value of the hyperparameter values chosen by this procedure were 
T n)V = 1.12 and r n/p = 1.16 for piMoM and peMoM priors, respectively. 

To make the comparison between the nonlocal priors and the g -prior more transparent, we used 
the same beta-binomial prior on the model space in both models, rather than the uniform prior on 
the model space described previously. The form of the beta-binomial prior was given by 


vr(k) oc p |k| (l - p) p |k| /(|k| < q n ), 


(13) 


with a uniform prior on p and q n = 40. We note that this prior does not strongly induce sparsity as 
does, for example, the prior obtained by imposing a Beta( 1, p u ), u > 1 prior on p, as suggested in 
Castillo et al. (2015 ). 
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Method 

MSPE 

MS 

FS 

TS 

piMoM(MAP) 

0.283 (0.17) 

1.00(0.00) 

1 

1 

piMoM(LS) 

0.282 (0.17) 

1.00(0.00) 

1 

1 

peMoM(MAP) 

0.291 (0.18) 

1.02 (0.14) 

1 

2 

peMoM(LS) 

0.287 (0.17) 

1.02 (0.14) 

1 

2 

g-prior 

0.368 (0.20) 

4.07 (0.56) 

1 

133 

lasso 

0.542 (0.39) 

17.97 (8.62) 

1 

211 

scad 

0.308 (0.23) 

12.66 (7.62) 

2 

163 

mcp 

0.308 (0.21) 

2.20 (0.94) 

0 

29 

Marginal(p = 2000) 

0.456 (0.40) 

17.47 (11.16) 

0 

273 

Joint(p = 2000) 

0.440 (0.40) 

16.42(11.06) 

1 

185 


Table 2: Analysis of the PCR data. Marginal and Joint refer to the variable selection procedures 
(Bondell and Reich| 2012) based on Bayesian marginal credible set and Bayesian joint credible 
set, respectively. MS is the average size of the selected model. FS is the number of frequently 
selected variables, i.e., that were selected at least 95 times in 100 repetitions. TS refers to the total 
number of variables selected at least once from 100 repetitions. Standard errors are provided in 
parenthesis. 


Table [2] summarizes the results from the analysis of the gene expression data set. On average, 
the nonlocal priors simultaneously produced the lowest MSPE and the most parsimonious model. 
The other model selection methods selected a wide array of different variables for different splits 
of the data set. In particular, lasso and the penalized credible region approach selected more than 
180 different variables from 100 repeated splits, while the average size of the selected model was 
less than 20 and the number of frequently selected variables was only zero or one, indicating a 
potentially large number of false positives picked up by these methods. 


8.2 A simulation study based on the Boston housing data 


We next examined the Boston housing data set that contains the median value of owner-occupied 
homes in the Boston area, together with several variables that might be associated with their median 
value. There were n = 506 median values in the data set, and we considered 10 continuous 
variables as the predictor variables: crim, indus, nox, rm, age, dis, tax, ptratio, b, and 
1st at. This data set has been used to validate a variety of approaches; some recent examples 
relevant to variable selection include Radchenko et al.| ( [201 l[ ), [Yuan and Lm| ( |2005[ ), and |Rockova 
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and George (2014). 


To examine the model selection performance in high-dimensional settings, we added 1,000 
noise variables that were generated independently from a standard Gaussian distribution (p = 
1, 010). The same competitors from the previous subsection were used with the aforementioned 
choice of hyperparameters. For nonlocal priors, the hyperparameter was chosen by the aforemen¬ 
tioned procedure (Nikooienejad et al., 2016) : the average of the chosen hyperparameter values 
were p = 2.01 and T n , p = 0.47 for piMoM and peMoM priors, respectively. Prior to analyses, 
we standardized the covariates and considered a simulation test size of 100 samples. 


Methods 

MSPE 

MS-0 

MS-N 

FS-0 

TS-0 

piMoM(MAP) 

24.281 (9.01) 

5.05 (0.22) 

0.01 (0.10) 

5 

6 

piMoM(LS) 

24.265 (9.04) 

5.05 (0.22) 

0.01 (0.10) 

5 

6 

peMoM(MAP) 

24.156 (9.02) 

5.02 (0.14) 

0.00 (0.00) 

5 

6 

peMoM(LS) 

24.165 (9.00) 

5.02 (0.14) 

0.00 (0.00) 

5 

6 

g-prior 

26.314(9.87) 

3.10(0.44) 

0.00 (0.00) 

3 

5 

lasso 

30.243 (11.82) 

5.07 (0.87) 

7.77(11.16) 

4 

8 

scad 

33.993 (10.66) 

5.39 (0.57) 

31.60 (28.28) 

5 

7 

mcp 

26.191 (9.87) 

4.66 (0.74) 

0.54(1.04) 

3 

6 

Marginal 

26.612(10.16) 

3.74 (0.88) 

0.41 (0.72) 

3 

7 

Joint 

26.385 (10.25) 

3.77 (0.94) 

0.02 (0.20) 

3 

6 


Table 3: The Boston Housing data set: MS-0 and MS-N refer to the average number of selected 
original variables and selected noise variables, respectively. FS-0 is the number of original vari¬ 
ables that are frequently selected at least 95 times out of 100 repetitions. TS-0 refers to the number 
of original variables selected at least once from 100 repetitions. 


The results of are analysis are summarized in Table [3] The conclusions are similar to those 
reported in Section 8.1; the nonlocal priors consistently choose more parsimonious models and 
had better predictive performance. The model selection procedure resulting from the nonlocal prior 
selects almost the same variables across the 100 repetitions. The average number of the original 
variables selected more than 95 times over 100 repetitions is 5, which is close to the average model 
size. It is also reliable in the sense that the average number of the original variables that are selected 
at least once across the repetitions is only 6. This means that model selection based on the nonlocal 
prior selects the same model in most data splits. On the other hand, penalized likelihood methods 
such as lasso and scad tend to select a large number of noise variables. 
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9 Conclusion 


This article describes theoretical properties of peMoM and piMoM priors for variable selection 
in ultrahigh-dimensional linear model settings. In terms of identifying a “true” model, selection 


procedures based on peMoM priors are asymptotically equivalent to piMoM priors in Johnson and 


Rossell (2012) because they share the same kernel, exp{— r np /f5 2 }. We demonstrated that model 


selection procedures based on peMoM priors and piMoM priors achieve strong model selection 
consistency in p^> n settings. 


In Section 3.1 precision-recall curves were used to show that the model selection procedure 
based on a g-prior can achieve nearly the same performance in identifying the MAP model as 
nonlocal priors when an optimal value for the hyperparameter g is chosen. However, as shown 


in Section 3.2 the value of the hyperparameter that maximizes the posterior probability of the 
true model is very large and has high variability, which may limit the practical application of 
this method. To overcome this problem, one can consider mixtures of g-prior as in Liang et al 


(2008), but the asymptotic behavior of Bayes factor and model selection consistency in ultrahigh- 
dimensional settings have not been examined for hyper-g priors, and they are difficult to implement 
computationally. 

In Section [7J we proposed an efficient and scalable model selection algorithm called S5. By 
incorporating the SSS with a screening idea and a temperature control, S5 was able to accelerate 
the computation speed without losing the capacity to explore the interesting region in the model 
space. Under some simulation settings, it outperformed the SSS in a sense that not only did S5 
search the MAP model much faster than the SSS, but it also found exactly the same MAP model 
that was searched by the SSS. 

Because the explicit form of the marginal likelihood of the nonlocal priors is not available, 


we used the Laplace approximation throughout the paper, and Barber et al. (2016) studied the 
accuracy of the approximation in Bayesian high-dimensional variable selection, especially when 
the dimension of the approximation (which is q n ) and n are both increasing. However, their results 
do not apply to the case of the nonlocal priors, since the nonlocal priors violate their regularity 
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condition (nonzero density at the origin). While empirical results in this paper and Johnson and 


Rossell (2012) suggest that the use of the Laplace approximation is reasonable, in future work 


it is still worth paying attention to the approximation error of the Laplace approximation to the 
marginal likelihood of the nonlocal priors. 

The close connection between our methods and the reduced rlasso procedures provides a use¬ 
ful contrast between Bayesian and penalized l ik elihood methods for variable selection procedures. 
According to the evaluation criteria proposed in Section [5j the two classes of methods appear to 
perform quite similarly. A potential advantage of the reduced rlasso procedure, and to the lesser 
extent the rlasso procedure, is reduced computation cost. This advantage accrues primarily because 
the reduced rlasso can be computed from the least squares estimate of each model’s regression pa¬ 
rameter, whereas the Bayesian procedures require numerical optimization to obtain the maximum 
a posteriori estimate used in the evaluation of the Laplace approximation to the marginal density of 
each model visited. However, the procedures used to search the model space, given the value of a 
marginal density or objective function, are approximately equally complex for both classes of pro¬ 
cedures. There are also potential advantages of the Bayesian methods. For example, it is possible 
to approximate the normalizing constant of the posterior model probability from the models visited 
by S5 algorithm, and to use this normalizing constant to obtain an approximation to the posterior 
probability assigned to each model. In so doing, the Bayesian procedures provide a natural esti¬ 
mate of uncertainty associated with model selection. These posterior model probabilities can also 
be used in Bayesian modeling averaging procedures, which have been demonstrated to improve 


prediction accuracy (e.g., [Raftery et al.;(1997'j )) over prediction procedures based on maximum a 
posteriori estimates. Finally, the availability of prior densities may prove useful in setting model 
hyperparameters (i.e., T n , p ) in actual applications, where scientific knowledge is typically available 
to guide the definition of the magnitude of substantively important regression parameters. 

We also developed an R package BayesS5 that provides all computational functions used in 
this paper, including a support of parallel computing environments. It is available on the author’s 
website and on CRAN. 
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10 Supplemental Materials 


The supplemental material contains proofs of the technical results stated in the paper and the 

Laplace approximations to evaluate the marginal likelihoods based on the nonlocal priors. 
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Supplemental Materials to “Scalable Bayesian Variable 
Selection Using Nonlocal Prior Densities in 
Ultrahigh-dimensional Settings” 


1 Preliminary Results 

Lemma 1. For Q k defined in ([6j), n*=i Qkj - Qk < Ely-i Qkj> 
where 


Qkj = c 1 ( lj2 ) 1/2 ( nly k + 1/ T n,p) 1/2 exp {-T n , p /fikj}, 

Qkj = c 2 (a 2 ) 1/2 (nv k * + l/Tn,p)~ 1/2 exp{-T n} p/(\j3k,j\ + £ n ) 2 }i 


and e n x (nv k */ T ntP ) 1 / 4 , with /3 k ■ G [fikj — e n , fikj + e n ] \ (—e n , e n ) c for some positive constants 
C\ and c 2 . 

Proof. Recall E k = (X^X k + l/r nyP \k)~ l - From ([8]), all eigenvalues of (£k) -1 are bounded 
between nv k * + l/r niP and nv £ + 1 /r ntP , which implies for all x e {rw k * + ^/T nyP )x T x — 
x T (Y, k y 1 x < (nv£ + l/r nyP )x T x. Let T ln = {(nu£ + l/r n , p )/cr 2 } 1/2 and T 2n = {{nv k * + 
1 / T n, P ) / ■ Substituting the above inequality in the expression for Q k , we have 


where 


9l(Pk,j) < Qk < 


92(fik,j), 


i= 1 


3 = 1 


(SI) 


9i{P kj) = / exp{-2^(/3 k j - Ac,i)7 2 - T'n.p/^kjl^kj, 


(S2) 


for i = 1,2. We establish the lower bound first by showing that gffikj) > Q k ■ for all j = 


1 






1,..., |k|. Recall e n x (n// k */T np ) 1//4 from the statement of the Lemma. We have 


0i(Acj) > / __ exp{—— ^kj) 2 / 2 — T n,p/Pkj}dPk,j 

j [flk,j-en,l3k,j+en]\(-en,en) & 

> exp {-Tn'p/Pgj} [_ __ __ expj-T^^kj - 3kj) 2 /2}o?^ kJ , 

^ [Ac,.? - en 5 Ac,.?+ e n]\( — e n ,6 n ) c 

for some /?£ ■ e [/3 kj - — e n , /3 kj + e n ] \ (—e n ,e n ) c . Then, the integral in the last line of the above 
display is equivalent to 



> £ n]\(-Ac ,j 




e~ T '^ /2 dt > ClTr 1 




/2 d^ > c 2 T ln \ 


where ci and C 2 are some positive constants and the last inequality in the above display follows 
since T Xn e n > 1 for large n. Substituting back in the previous display, <g\ (/3 k j) > c{T{2 exp{— r„„ p //3 k 2 j} 
for some constant ci > 0, completing the proof of the lower bound. 

We now establish the upper bound by showing that .g 2 (/5 k j) < Q k / for all j = 1,..., |k|. It 
is straightforward to see that g 2 is a symmetric function (i.e, .9 2 (/3 kj ) = .92(|/3 kj l))> so that it is 
enough to establish the bound for /3 kj - > 0; without loss of generality we assume that /3 kj > 0. 

We have 


+ 

+ 


' —OO 

/>0 


exp{-T 2 n (/3 kj - /3 kJ ) 2 /2 - T niP /3 kj }d^ kj - 
exp{-T 2 n (/3 kj - - 3kj) 2 /2 - r njP /(3kj}df3kj 


f/3 kj+^n 


exp{-T 2 2 n (^ kJ - /3 kJ ) 2 /2 - T n>p //3^-}d^ k j 


Ac,? 


exp{—T 2 n (/3 k j - /? kj ) 2 /2 - r niP //3 kJ }d/3 kJ . 


Define the first term of the above as Wi, the second as bF 2 > an d the third term as W 3 . First, we 
shall show that W\ < cT.^ exp{—T 2n (2r n)P ) 1 / 2 } for some positive constant c. By transforming 


2 



the variable t = /3 kj — /3k j. 


= / exp{-r^/2 + r 2 ^ - T^/2 - r n , p /* 2 }<ft 


r —oo 
/■O 


< 


exp{-T 2 n r/2 - T n , p /t 2 }dt 


< c 3 r 2n 1 exp{-r 2n (2r n , p ) 1/2 }, 

for some constant c 3 , since / exp{—/i/t 2 — £f 2 }<if = (7r/C) —1//2 exp{—2(/x£) 1//2 } for yU > 0 and 

C > o. 

Second, by changing the variable z = t — e. 


W 2 = 


r 0kj 


exp {-T 2 n (z - /3 k ,j + e n f/2 - r, hP /(z + e n ) 2 }dz 


< exp{-r„ iP / (/3 k ,j + e n ) 2 } / exp{-T 2 2 n (z - /3 kj + e„) 2 /2} 

J —OO 

< c 4 T 2 " 1 exp{—r niP /(3k j + e n ) 2 }, 


for some positive constant c 4 . 

Third, by changing the variable z = t — /3 kj , there exists some positive constant c such that 


fix = 


exp{—T 2n ^ 2 /2 - T n , p /(2 + 0 k j) 2 }dz 


< exp{—T 2n e n /4} / exp{-T 2n z /4}ofe 

J — OO 

< c ^ 2 n exp{ -c 6 T 2n r 4/2 }, 


for some constants c 5 and c 6 , since e n x (rWk*/T n ,p) 4 / 4 . Then, 

^(Acj) < c 3 T 2 “ 1 exp{-T 2n (2r n)P ) 1/2 } + exp{-T n>p /(/? k j + e n ) 2 } 

+c 5 r 2 " 1 exp {—c 6 T 2n r ? 4 ^ 2 }. 

Since e n x (n^/T,^)" 1 / 4 , when < e n , T UtP /0k,j + e n ) 2 < r nj) /( 4e^) x T 2n Tn( p , and 


3 



when /3 kJ > e n , r n . p / (/3 kj + e n ) 2 < r„ iP /(4^ •) < T 2n Tn,p. In overall, the right-hand side 
of the above display would be dominated by the second term, which shows that t/ 2 (Acj) < 
cT^ 1 exp{—T np /(/9 kj + for some constant c. When /3 kj - < 0, we can show the same re¬ 
sult by following exactly the same steps explained above. □ 

We now present some auxiliary results that are used to prove Theorems 1 and 2. We make use of 
the following simple union bound multiple times: for non-negative random variables Vi,... ,V m 
and a > 0, 

m m 

p (E v ‘ >«) P(Vi > a/m) < m max P(Vi > a/m). (S3) 

/ i i=i 

We define some notations that are used in the subsequent proofs. Let t denote the true data 
generating model, and let 3/ denote the true regression coefficient corresponding to t. Let c t = 
t \ k, c k = k \ t, and u = k U t. Also, we define the cardinality of a model k as k and in the 
same spirit, denote Ck = |c k |, c t = |c t |, and t = |t|. {a;} ; denotes the j-th element of the vector 
x, and diag{A}j refers to the j-th diagonal element in the square matrix A. We denote XmW a 
non-central chi-square distribution with the degrees of freedom m and non-centrality parameter A; 
a central chi-square distribution is simply denoted by y 2 r 

An important property that is used in the subsequent proofs concerns the distribution of the 
marginal ridge estimator. Let /3 k = (X£X + l/T„ -?j / k ) 'A" k y and A kj = {/j k } ? . Then, 


/3kj ~ N(3 


k ,ji 



(S4) 


where ,6^ = {(X£X + l/T rup I k r 1 X'/X tl 6;} ] and = a 2 dzag{( X£X k + l/r^I^jj. It is 
also evident that (f3' k j - 3 k j) 2 / cT k j ~ Xi- 
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A set of technical results follow that are used in the proof of the main results. Define 


H\n 


m k (y)n(k) 


E 


|k|<5„ |k|<g„ 


^( k I y) 7T _ 
-fJ2n — 


m k (y)7r(k) 


|k|<g„ 


7r(k | y) 
*<*1») 

|k|<(?n 


(S5) 


Lemma 2. Fa e > 0. Let = {k : |k| < q n , t C k, |k| — |t| = d} for d = 1 ,q n — |t|. 
Suppose there exist constants c, 5 > 0 such that max ke r d P{vr(k | y)/ 7r(t | y) > ep~ d /q n } < 


cp 


— c?(l+<5) 


for d = 1 ,,q n — |t|. Then, II\ n converges to zero in probability as n tends to oo. 


where II \ n is as in ( S5). 


Proof Clearly, |r rf | = ( p J^). Using ( |S3| ), we bound 


p{ y. * (k 1 - > < 


k:tC k 


7r(t | y ) 


Qn |t| , \ 

p{ E E™>^ 


d =i ker d 


vr(t | y) 


Qn |t| ,, | \ 

< E' E=£M>'/* 


d= 1 

?n~|t| 

s E 

d= 1 


ker d 


vr(t | i/) 


Finally, C L d<S — c <2ViL 5 —» 0 as n —>• oo. 


c t} for k = 0,...,q n ] c k — Q,...,k] c t = 1,..., t. Suppose 


] maxPi 

7r(k 

1 y ) 

7r(t 

1 y) 

J ker d l 

• oo. 



\c k ,c t = {k 

: |k| 

VI 


Qn |t| 


cp 


—dS 


d= 1 


□ 


max P 

k 




7r(k | y) 
Lvr(t | y) 


> en 3 p k n Ck t t 


< cp~ k( - l+5 \ 


with some postive constants c and 5. Then, H 2n converges to zero as n tends to oo, where II, n is 


as in (S5). 
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Proof. Clearly, | T 


©(‘JO- 


k,c k ,ct 


r ^ 7r(k I y) 

1 J? k nit i 



Qn k t 

* ^{EEE E 

^^ i ,cf 

Qn k t 

s ^{EEE E 

^—1 Cfc 0 c-t 1 kGr fc , Cfc)Ct 

Qn k t 

* EEE P { E 

k —1 c/j Oct 1 
Qn t 


7r(k 

1 y) 

vr(t | 

y ) 

7r(k 

1 v) 

vr(t | 

y) 

7r(k 

1 y) 

vr(t | 

y) 


> e 


> e 


-3 


> en 


< S' y~ V' p k n Ck t t max p\ "J ^ > en 3 p k n Ck t * j 


k =1 Cfc=0 ct = l 
Qn k t 

s EEE p k n Ck t t p~ k ( 1+5 ' > ->■ 0, 

fc=l c^=0 ct = l 


as n —>• oo. 


□ 


Lemma 4. Suppose W follows a non-central chi-square distribution with the degree of freedom 
m n that is a positive integer and the non-central parameter X n > 0, i.e, W ~ Xm„ (A„). A/ 50 , 
consider w n and t n such that w n —? 0 t n —>■ 00 as n tends to 00 . Also, assume that m n -< t n . 
Then, 

P(w < Kw n ) < ciA" 1 exp{—A n (l - w n ) 2 }, (S6) 

And 

f t \ mn/2 f f 2 1 

P(1C > A n + t n ) < c 2 ( ex P {™>n /2 - /n/2} + c^l/Hf 1 ex P I32A - J ’ ^ 

where c\, c 2 , and c 3 are some positive constants. 

Proof. W can be expressed as W = Y1T= i(^ + (An/mn) 1 ^ 2 } 2 , where Z* 7V(0,1) for i = 
1,_, m. Then, by the fact that P(Z > a) < (27r) _1 / 2 a _1 exp{— a 2 /2} for any a > 0, we can 
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show that there exist some positive constants c\ such that 


P(W<\ n W n ) = P{Y J Zl + 2{\ n /m n ) l / 2 Y J Zr + \n<\nW n } 

i —1 i =1 

m n 

< P{m- 1/2 J2 Z i < -Ay 2 (l -w n )/2) 


i= 1 


= P{ |Zil>Ay 2 (l-u.„)/2}/2 
< CiA; 1 exp{-A„(l - w„) 2 /2}, 


since Z\ follows a standard normal distribution. 

Also, by using Chernoffs’s bound and the fact that P{Z > a) < (27r) _1 / 2 a _1 exp{—a 2 /2} for 
any a > 0, one can show that 


< 


< 


P(W > A n + t n ) = p\Y J z -+ 2(A n /m n )P 2 Y J Zi>P 


i= 1 


2—1 


P E ^ > ‘”/ 2 + P \ E Z ‘ > K' ,2 tn/4 


C 2 


„ 2 — 1 


2m r , 


2—1 


m n / 2 


exp 


{m n /2 - t n /2} + c$)^l 2 t n x 


exp 


32A r 


where c 2 and c 3 are some positive constants. □ 

Lemma 5. Consider O k defined in ([6]) /or an arbitrary model k. Fix any 6 > 0. For any k with 
tCk, 


^ [Qk/Qt > exp { — |k \ t| T 2 P(no k *) 1/3 + |t|T^/ /8 (no k *) 5/8 }] < p 


-|k\t|(l+5) 


(S8) 


and for k such that t ^ k, 


P [Qk/Qt > ex P {||/?t ll^^u*/ {2 log( r n,p/ l°gp)}}] < V |k|(1+<5) . (S9) 
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Proof. By Lemma |T| it is sufficient to show that 


p II> ex p(l t K/ / 8 (^k*)' 5/8 } 

JG t 

< p -|k\t|(l+«5)_ 


+ P 


II Qkj > eX P{-| k \ t l r n, / p 3 ( nZ/ k*) 1/3 } 

j'Gk\t 

(S10) 


We first shall show that the first term in the left-hand side of (jSTOj) is bounded above by 
exp{—cru'k*} for some constant c. 


P 


n 

bet 


> 


ex p{l t l T L i/8 ( w '»)* /8 } 


= E p 

jet 

< E p tiA,j 

jet 


9m 

Qt* 

, (+ l/r n , x _1/2 


< 


jet 




k,J 


% 


>exp«// 8 (n^) 5/8 } 


nu* + 1/r, 


n,p 


exp | —T niP (l/{\Pkj\ + e n ) 2 - 1/^k 2 ) } > exp {r^/ /8 (n^) 5/8 } 

(Sll) 


jet 


for some small enough e' > 0 and some positive constant d and fif.- e [/3k j — f n , /3k,j + b] \ 
(— e n ,e n ) c as defined in Lemma 1 and /3 kj and /3£ ■ defined in ( fS4] ). The last inequality in the 
above display asymptotically holds, since 


i p S/ \nu k *) 5/8 b T n J (1/3^1 - e' - e n ) 2 , 


one can show 


for any 5 > 0. 

Since (/3 k y - /3£j) 2 /°kj ~ Xi and > (™^k* + l/r niP ) _1 , by using Lemma ■ 
that the first term in ( |S11| ) bounded above by exp{—cie' 2 nzj k *} for some constant c\. Similarly, the 
second term in ( |S11[ ) is bounded above by exp{—c 2 e /2 n} for some constant c 2 , since Assumption 


5 states that XfX t /n is asymptically isotropic. Therefore, (]S 11 [) is asymptotically bounded by 


P 


-g„(l+<5) 


by Assumption 3. 


Next, we shall show that the second term in the left-hand side of (S10) is bounded above by 
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exp{— CTn,p (nvk*) 2 / 3 } for some positive constant c. Since when j e k \ t and tCk, /3£ ■ 


n 


-i 


P 


n > eX P{-| k \ t l r ni 3 (^ Z/ k*) 1/3 } 
jek\t 


jek\t 


c'invkj + l/r n , p ) 1/2 exp 


T, 


n,p 


(l^kjl + e n y 


> exp{-r 2/ p 3 (n^) 1/3 } 


jek\t 

* E p 

jek\t 


P Pi, 3 > 1 T n,p (( nzy k*) 1/3 r 2/ p 3 - log(ra/ k * + l/r n ,p)/2 + log d) 


/\-i/2 


- e r 


(Ac,j - ^-)7^ > * 


{ Tn >p A 
V^W 


1/3 


(f'M'k* + 1 /r n:P )/a 2 


for some positive contant d and c". Since (/3kj — ftk j) 2 / a tj ~ xl by Lemma |4| the last quantity 
in the above display can be bounded by exp{— CTn : p [rw^ ) 2 / 3 } for some contant c. By Assumption 
3, exp {—CT^p (rw k *) 2 P} -< p~i^ 1+s ') < pl k \ t l( 1 + 5 )l ; which proves the statement ((STOj). 


We now shall show that the equation (JS9J) holds for any <5 > 0. The left-hand side of flS9| ) can 
be bounded above by 


P 


n ( II Qt) > eX P { \\Pl\\l nU »*/{ 2 lQ g(W logp)}} 


-jek 


je t 


< TP c(nz/ k * + l/r n , p ) 1/2 exp {-T„ iP /(|/3 ki j| + e n ) 2 j > exp {\\dl\\ 2 2 nu Uif /{A\k\ log(r n , p /logp)}} 


jek 


y P d(nvt* + l/r nj p) 1/2 exp |r n)P /(^ 2 )| > exp {11/5°IIl^^ u */{ 4 |11 log(r n , p /logp)}} 

jet 


e E p 

jek 


T, 


n,p 


‘ > HAll2^m/{ 4 l k l lo g(WlogP)} +l°gC 

(| Pk,j | “1“ ^n) 

+ [fe-l < c, 1lAll2 1 (^u*)" 1/2 { 4 |t|log(r n , p /logp)} 1/2 T 3 ) / p 2 

jet 

where c, d, and d' are some positive constants. 


(SI 2) 
(SI 3) 


( |S12[ ) is always zero since the left-hand side in the probability is always negative and the right- 
hand side in the probability operator is always positive. So, we focus on ( |S13[ ) as below: 


Since /3 t ,j ~ e n < j3l,j < Ptj + e n implies |/3 tj | - e n < < \/3 t ,j\ + e n , ( |S13| ) can be 
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bounded above by 


P < c/, H/ ? t°ll 2 1 (^u*) 1/2 {4|t|log(r niP /logp)} 1/2 r^ 2 


je t 


< P < cl^tl^fVu*) 1/2 {4|t|log(r n , p /logp)} 1/2 rV p 2 + ' 
jet 


where <3(j is defined in ( [S4] ). Since P‘tj/cr‘t :j ~ an ^ x /n for J £ t, by 

using Lemma [4] and Assumption 5, one can show that the probability is bounded by exp{— cn} for 
some constant c, and it is evident that exp{— cn} -< p _ l k l (1+ ' 5 ), which completes the proof of the 
Lemma. □ 


2 Proofs of Main Results 


Proof of Theorem [TJ We have 7r(t | y) = m t (y)7r(t)/{^ k .| k | <qn m k (y)7r(k)}, since 7r(k) = 0 
for any k with |k| > q n . Recall II ln and H 2n from ( [S5] ) and note that 7r(t | y) — (1+H ln +H 2n )- 1 . 
Hence to show that 7r(t | y) converges to one in probability, it is sufficient to establish that II\ n 
and H 2n both converge in probability to zero as n tends to oo. We shall prove the Theorem by 
showing: 

For any 5 G (0, 8/3) and any model k G Td (defined in Lemma[2]), 


7 r(k | y) 
vr(t | y) 


> ep 



< p~ d{ - 1+& \ 


(S14) 


and for any model k G r k,c k ,ct (defined in Lemma[3]), 


7 r(k | y) 
7 r(t | y) 


> en 3 p k n Ck t 1 


< cp- k ^ 1+5 \ 


(S15) 


Then, it is evident that Hi n and II 2n both converge to zero in probability by Lemma [2] and [3] 
respectively. 
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First, we shall show that (|S14[) holds. For any k 6 T* recall that 


P 


* (k 1 a) > £ p-v‘ 

ir(t V) P q " 


< P 


n ’ p Qt P 


^(i? k - P. t ) } > ep d /q n 


Since i? k > and /i t < /?,£ + rj, where // = dip[/3 t /T n , p for some constant <7, and /? t is the 


ordinary least square estimator of /3 t in the true model t, by using ( [S3] ), the term in the last display 
can be bounded above by 


y-dQ k 


C n d p-Q~ t ex P { - (K ~ R*t)/(.2v 2 ) + v /{ 2(j2 )} > 7 Qr 


P 

< P 

+P [P* t -R* k > 2a 2 d(l + S) logpj 
+P [exp{p/(2cr 2 )} > ep 5 ] . 


fi-dQk d(l+6)+6 -d/ 

L 'n,p qJP > e P /Qn 


(516) 

(517) 
(SI 8) 


By using Lemma [ 5 J ( |S16[ ) is less than p~ d ( 1+s ') when 5 < 8/3. Since (Rl — RD/a 2 ~ xf k \t|’ 
using ( |S6| ) in Lemma [ 4 J we can show that ( |S 1 7[ ) is bounded by C p- d ( 1+<5 ) for some positive constant 
c. Since T ri ^w u rjj d\ 0 2 < X^ X t 3 t /a 2 ~ X\ t \ X t [3() , by using the inequality ( |S7| ) in 

Lemma [4] ( |S18| ) can be expressed as 


P [exp {p/2cr 2 } > ep 6 ] < P [r n}P nu u r) / dio 2 > 2r niP nz/ t * (log e + <51ogp)/di] 

< P\Pt x t x tK/cr 2 > 2r niP ni/ t *(loge + 51ogp)/dJ 

< (nS logp)^^ 2 exp{— c\8{n logp)} + n _1 / 2 (<5 logp) -1 exp{—c 2 (n logp) 2 /n} 

< c 3 p- |k|(1+5) , (S19) 


for some positive constant c\, c 2 , and c 3 , which proves that (S14) holds. 
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Next, we consider (|S15[). Recall that u = k U t. By using (S3), it can be shown that 


P 

< P 

< P 


> en“V |k| ^ |kVt| |t| 


-?r(t | y) 

c -(|k|-|t|)Qk 
L 'n,p ^ 

Vt 

'c-w-i 


{-(*. 


Vt *• J 

'*“^exp{-(K-K)/(2<T 2 )}> 

Vt 


R t )/(2o 2 )J > en-V |k| n- |k \ t| |t|- |t| 

-3-|k\t|| t |-|t|^-|k|(2+5)- 


n 


n,p —i i ~k ~ ~u/ / v— ' j - - i - i X- 

Vt 

;p |k|(1+5) ] +P [ exp ( v /(2a 2 )) 


> ez 


+P[exp{(Ft;-Rl)/(2a 2 )} ^ 

< P[exp{(fl;-K)/2ff 2 }>£p' k " 1+i >] 
+P [ exp (r//2a 2 ) > //] 


> p 


+P [P k - P* < 2(T 2 ||/3 t °||^m/ u */log(r rii p/logp)] 

+P [Qk/Qt > exp {||/3t || 2 ni/ u*/{2 log(r njP / logp)} }] . 


(520) 

(521) 

(522) 

(523) 


Since (P{ — P*)/cr 2 follows a xf u \ t | distribution, ( |S20[ ) is also bounded by cip^ k ^ 1+s ' > with 
some constant ci. By following the same steps regarding ( |S19[ ), one can show that ( |S21| ) is 
bounded by c 2 £H k ^ 1+<5 ) for some constant c 2 . We note that (P£ — P*)/cr 2 ~ x?\ k i(A n ) with 


A n = (P u — Pk)9f t /3°, where Pk is the projection matrix of X k . As discussed in Narisetty 


and He (2014), A n > nz/ u *||/5t Hi- Hence, by using Lemma[4j one can show that < |S22[ ) is bounded 
by exp{—c 3 1|/3°Ulrzz^u*/log(r n)P /logp)} for some constant c 3 . Lemma [5] states that ( |S23[ ) is 
bounded by p~l k K 1+<5 ). In summary, since q n -< r n)P /logp by Assumption 3, there exists some 
positive constant c 4 such that P[7r(k | y)/ 7r(t | y ) > en _3 p _ l k ln _ l k \ t l < C 4 j)~^ 1+S \ which 

completes the proof of Theorem [I] □ 

Proof of Corollary [2] Recall the penalty term of a model k, Q^. based on the piMoM priors is 

r |k| l k l 

Qt = / ex P { - (A< - 3k) T Sk~ 1 (/^k - Ac)/(2 cr 2 ) - Y T n , p /P kj - r Y lo g(^kj)}^ k ’ 


j =i 


3=1 


in ([7]). Since, for any e > 0, exp [ — Yjti{ eT n,p/Pk,j+ r l°g(/5 k ,)}] is bounded above with respect 
to 0k j, Q*k<C J exp { — (,5k - Ac) T Sk _1 (/3k - Ac)/(2cr 2 ) - £^(1 - e ) T n, P / Pljjd/3 k for some 
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constant C. Following the exactly same steps in Lemma [lj Qt < C'invZ)- 1 /* nl k i 2 exp{ —(1 — 
e ) T n, P / (\/3k,j | + A) 2 } for some constant C > 0. 

We shall show that the model selection procedure based on piMoM priors as in ([4]) assures 
consistency by proving that Q k and Q k are asymptotically equivalent. 

Next, we shall show that is bounded below by C(nu^)~ 1 / 2 exp{ — (1 — e)T n . p /A 2 } 

for some constant C > 0 and A[ :J G [Aj - e n , A j + A]- Since exp { - er njP /^ kJ + r log(A^)} 
can be minimized in [Acj — A, Aj + A], by following the proof of Lemma[lJ 

/ OO 

exp{-nz/£(/3 - 3k ,j) 2 /(2a 2 ) - T n , p //3 2 - r log (/3 2 )}df3 

-OO 

rPkJ+tn ^ 

> exp {-nv£(P - /3k,j) 2 /(2a 2 ) - (1 - e)r n , p /(3 2 } exp {-eT n , p //3 2 - r log (/3 2 )}d/3 

•'Ac ,j~ e n 

> C(nv^)~ 1/2 exp |-(1 - e) r n, P /3k 2 j} , 


where C is some constant and ■ G [/3 kj — A, Acj + A] \ (—A,A) c - 

Therefore, due to the asymptotic similarity between the ridge estimator and the least square 
estimator, the lower and upper bounds of Q k are asymptotically equivalent to those of Q k with the 
penalty parameter (1 — e)r n , p , which assures the strong consistency of the model selection based 
on the piMoM priors. □ 

Proof of Theorem [3l Under a situation where a 2 is unknown, it is clear that 


.M 


m k (y ) = T n ,p 2 / (2na 2 ) 2 


exp 


lk „l 


1/2 


(A 


-1 


A0 T £ k 


(A — A 


|k| 


3=1 


2cr 2 


where 7r(a 2 ) is the prior for a 2 (Inverse-gamma density with hyperparameters ao and b 0 ). 

First, we shall show that the ratio between marginal likelihoods of a model k and the true model 


2 )d/3kda 2 , 
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t can be bounded as 


rnJjj) 
m t (y) 


( Rk + 26 o ^ 
\ -Rt + 2&o / 


—n/2—ao 


exp 


|k| 

£ 


r, 


n,p 


= 1 ( \fikj | T ^n)‘‘ 



(^k*T n ,p + 1) |k|/2 

(nu*Tn }P + l)-!*!/ 2 ’ 

(S24) 


where /3£ ■ G [/3 tJ - — Li, Aj + Li] \ (—?n, L,.) c for j G 1,..., |t| and c is some constant. Next, we 
shall show that {(f? k + 2& 0 )/(-Rt + 2b 0 )}~ n ^ 2 ~ a ° < exp{ — (R k — R t )/(2ol(l + u n ))}, where Oq 
is the true regression variance that involves in the data-generating process, and u n is some random 
variable that is concentrated around a finite value with at least probability 1 — exp{— cn} for some 
constant c. Then, by following the same steps in the proof of Theorem [I] the proof of Corollary [2] 
is completed. 

By Lemma [I] the marginal likelihood of a model k can be bounded by 


|k| f 0 n+ 2a 0 . I f 2 \ ^ J—L, 

m^(y) < {ci{nv^T n ,p + 1)} 2 J (a ) 2 exp < |k| J - } 


T, 


n,p 


= 1 ( \fik,j | T 


Rk + 2&o 

2cr 2 


|k| 


T, 


n,p 


< {ci(m/ k *r n)P + 1)} "a" exp { - V - _ 

j=i (I Pk,j | + e n ) 


(1 + exp{2|k|}) (Rk + 26 0 


n+2ag 

2 


for some constant ci. 

Also, by using Lemma |TJ one can show that 


m k (y) > {c 2 {nv^T n , p + 1)} 2 / (cr 2 ) 2 exp <( |k 


, . , oX 1 / 2 l k| D 1 Ok 

_1M / n L 2a ° 1 J 11,1 ( Z T n,p ilk + f ^2 


cr- 


/% 2 2cr 2 

7 = 1 k,1 


|k| 


> {c 2 Ai,p + 1)} 2 exp < - ^ > (f? k + 2 h 


o*‘z 

J=1 p k. 


*2 

J 


n+2aQ 

2 


where c 2 is some constant and /3£ ■ G [/3 kj — e n , /3 k y + e n ] \ (—e n , £n) c for j G 1,..., |k|. These 


results shows that (S24) holds. 


Next, we consider the asymptotic behavior of {(i?k + 2b 0 ) /( R t + 2b 0 )} n R a ° in ( |S24| ). Define 


do 2 
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p n as the follows: 


pn = (Rt + 2b 0 )/(na%) - 1 . 

Since — log(l — u) < u/(l — u ) for «el, 

— log{(-Rk + 26 0 )/(f? t + 26o)} = ~ l°g[l + (-Rk ~ + Pn)°o}] 

< (R t - R^)/{nal(\ + u n )}, 


where u n = p n + (P k - Rt)/(nal). 

Since (P£ - R* u )/a$ ~ X|u\k|(A„) with X n = /3^ T X//(P U - P k )X t/ 3 t % rg, by using Lemmag 
one can show that 

P(\u n -X n /n\>e) < P(\p n \>e/A) + P{(R* t ~R* u )/(na 2 0 )>e/A} 

+P {|(p;- R* u )/(na 2 0 ) - K/n\ > e/4} + P {p/2 na 0 2 > e/4) 

< exp{-c'n} + P {| (R k - R* u )/(nao) - X n /7i\ > e/4} 

< exp{— c"n}, 

for some constant d and c", and p is defined in the proof of Theorem[I] Also, by Assumption 5, 
A n /n will be bounded below and above. □ 

Proof of Corollary |4j Since we showed that the asymptotic equivalence between Q k and Q k in 
the proof of Corollary [2j by following exactly same steps in the proof of Theorem [3] we can prove 
the model selection consistency under piMoM prior densities. □ 

Proof of Proposition [5] We shall show that for any o k = /3 k + e n with e n = {e n ,j}j=i .|k| and 

\e n j\ y e* for at least one j e (1,..., |k|}, P{g(a k ; k) < g(/3 k , k)} —* 0 as n tends to oo, where 
G B(/3 k ; e*) with e* x (r^p/n) 1 ^ 3 . More specifically, we set = /3 kj - + e* for jet and 
(3 £ • = /3 kj for j e t c . Without loss of generality, we assume that XJXj = n for j = l,... ,p. 
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Note that 


q{ol k ;k) 


|k| 

||X k a k -X k /3 k ||2 + p/\ a V.,j\ + D, 

3 =1 

|k| 

^ ^ j T Ti,p/1 Pk,j “I - Cn j | } “I - D n , 

3 =1 


for some constants such that Cl < Cj < Cjj for j = 1,..., |k|, and some randome variable D r 
that are not relevant to o k . Then, 


P{g{a k;k) < g{PZ; k)} 

|k| 


< P 


< P 


+ P 


+ P 


|k| f ) l k l f ) 

X \ C 3 ne n,3 + | X | , P | 7 ~ tn ’3 } < X { 

j'6S*n5 k , n l \PKj\ ■+■ l e «dl J ies*n5 k ,„ l 

X (cjwe^-+ ^ Tn ’ p -t„A < X I 

?*nS£ 1 \(^Kj\ + \ e n,j\ J j£S*nS? „ \ 


*2 , 'n,p 
c j ne n + 


l/^'l 


i65 


E 

b'65« 


*2 , 
Cj-ne^ + 


1/3; 


k,jl 


2 , 1 n,P 

C j ne n,j + 


l^kj'l + \ e n,j\ j£S* ^ 


X < X C P le n + 


*2 , T n,p 


JSS* 


1/5, 


kj'l 


(525) 

(526) 

,(S27) 


where t n is an arbitrary sequence such that t n j = n 2 / 3 Tn,p€ n j, and S* — {j G {1,... ,p} : \e n j\ y 
e*}, and ,5' k . r) = {j e k : |/3 kj | < e* }. Then, to complete the proof, it is sufficient to show that 
each of ( |S25[ ), ( |S26| ), and ( |S27| ) converges to zero. 

Since n0 k j - Pt j)V^ 2 ~ xl for j = 1,..., |k|, 


> Cn) < (7r</2)-^exp{-</(2a 2 )}, 


for any Cn > 0. This implies that S^ jn = t at least probability 1— |t“|( TO£ ; 2 /2)- 1/2 exp{-< 2 /(2<T 2 )}. 
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Therefore, the equation (|S25|) can be asymptotically bounded by 


E * 


jeS*nt 


C J ne nJ + 


n -P 


t n ,j < CjTie* 2 + 


21 e 


r, 


n,p 


n,j | 


|/?k ,j + £ 


— ^ ^ I* l/^kj T ^nl ^ n,j T Ai,p/|^n,j|) 

jeS*nt 


for some positive constant c. Consider Lemma[4]with A n = ne* 2 / a 2 and w n = c 2 r 2 p /{e* 2 (ne 2 ^ — 
t n ,j + T n>p /|e n j|) 2 } for j e S'* fl t. Since ne 2 j >- n l / ?J Tn.p for j e S'* implies —>• 0, Lemma [ 4 ] 

guarantees that the last display is bounded by c'|S'* n 11 A” 1 exp{—A n (l — w n ) 2 } for some constant 


d , which means that (JS25J) converges to zero as n tends to 0. By following the same steps, one can 


show that (S26) converges to zero. 


Also, (S27) can be asymptotically bounded by 


E * 

j£S* c nt 


2 1 ^ *2 , Ai ,p 

Cj7ie n . H—:- r + c mm t n ,■ < c,-ne + 

J 21 | jes* 


“",3 I 


+ e* 


E * 

jeS* c r\t c 


Cjne nJ + 


T. 


n,p 


+ c min t n j < Cjne* 2 + 

2|Am+4l l/Vj+^l 




< P I + e n I < C Vp ( ne L- - ne n 2 + C f n,j + Tn,p/ \ tnj I) 1 

jeS* c nt L J 


E ^ 

jeS* c nt c 


13k j + e* n \ < d'T 7hP (ne 2 n j - ne* 2 + cminf nj - + r n)P /|e nii |) V 2 

jES* 


where c, c', and c" are some positive constants. For the first term in the last line of the above display, 
by setting A n = ne* 2 /cr 2 and w n = c 2 r 2 p /{e* 2 {ne 2 nj - ne* n + cmin je5 * t n j + T nj> f |e nJ |) 2 }, we can 
apply Lemma |4j Since w n -< r 2 p (e* n 1 i 1 i ye t n j)~ 2 implies w n — > 0, the first term in the above 
display converges to zero by Lemma |4j Similarly, the second term also converges to zero. □ 
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3 Laplace Approximations of Marginal Likelihoods 


In this section, we provide the Laplace approximation of the marginal likelihoods based on the 
nonlocal priors. Because closed form expressions for posterior model probabilities based on mod¬ 
ified peMoM priors and modified piMoM priors are not available, we estimate the posterior model 
probabilities using Laplace approximations. For posterior probabilities based on the peMoM pri¬ 
ors, an inverse-Gamma density with parameters (a 0 , b 0 ) on a 2 the Laplace approximation to the 
marginal density of the data for model k can be expressed as 

vr(k | y) oc (2vr) |k|/2 \V(P k , a 2 *)\~ 1/2 exp{/(0£, cr 2 *)}p(k), (S28) 


where 

(X,ct 2 *) = argmax/(/3 k ,a 2 ) 

Gk.o' 2 ) 

f(P k, v 2 ) = ~ in /2 + |k|/2 + a 0 + 1) log a 2 - (y - X k (3 k ) T (y - X k fi k )/{2a 2 ) - 0£/3 k / (2fx 2 r niP ) 
]k| 

~^2 T n,p/Pl,j + l k l( 2 /^ 2 ) 1/2 - b o/° 2 + |k|(logT n>p )/2, 

3 = 1 

and V(f3 k , cr 2 ) is a (|k| + 1) x (|k| + 1) matrix with the following blocks: 

V n = X£X k /cr 2 + I k /cr 2 T n>p +diag {6T n>p //3£j } j=h .. 

V u = X£( : y - X k /3 k )/a 4 - /3 k /{a 4 r njP } 

V 22 = -(n/2+\k\/2 + a 0 + l)/<J 4 + (y-X k /3 k ) T (y-X k /3 k )/a 6 -/3 k ^ k /r njP 
-3|k|2 1/2 cr- 5 /4 + 2 b 0 /a 6 . 
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For the piMoM priors on Ac, the Laplace approximation of the posterior model probability can be 


expressed as in (S28), but with 


/(A<, cr 2 ) = ~ ('n/2 + a 0 + 1) log cx 2 - (y - X' k f3 k ) T (y - X k f3 k ) / (2o 2 ) 

|k| 

- ( r lo s(^kj) + r n,p/Pl,j} + l k l{( r - V 2 ) log Tnj, - logT(r - 1/2)} - b 0 /a 2 , 

3 = 1 

and V(/3k, a 2 ) a (|k| + 1) x (|k| + 1) matrix with the following blocks: 


Ai = X£X k /a 2 + diag{6r ntP //3l j - 2r//3 2 j } j=1 |k| 

V 12 = X£(y-X k/ 8 k )/a 4 

V 22 = —(n/2 + a 0 + 1)/ a 4 + (y — X k f3 k ) T (y — X k (3 k )/ a 6 + 2b 0 / cr 6 . 
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